Skip to contents

Introduction

In many occasions, it is necessary to scan a text or entire corpora for specific words of themes. For instance, a researcher may want to explore how often parties refer to “democracy” and other related terms in their manifestos or the speeches in parliament. They can also be interested in categories such as different emotions or policy areas.

Here we will present functions related to three tasks or procedures that are often used in text analysis to scan texts. These are: contextualization, salience, and positioning. The first shows words in context, allowing users to see how they are used in sentences or paragraphs. This resource is particularly usefull to determine the degree of ambiguity or polysemy of a word. The second shows the importance of words in a text or corpus, either by expressing their weigth in a text or by comparing their frequency in different texts. This feature is particularly useful to select texts or identify those dcouments that are more relevant to a specific topic. The third shows the position of words in a text or corpus, allowing users to see how words are distributed in a text or corpus. This feature is particularly useful to identify the position of words in a text or corpus.

Contextualizing keywords using wordTree

The function wordTree employs a Google charts visualization to show the context of a keyword in a text or corpus. The function requires a corpus and a keyword as arguments. The function will return a visualization with the keyword in the center and the words that appear before and after the keyword.


#  Load the packages
library(quanteda)

# Create a corpus with the inaugural discourses
# of Spanish presidents
cp <- corpus(spa.inaugural)

# Create a wordtree with the word "libertad"
wordtree(corpus = cp,
         keyword = "libertad",
         height = 800)

The chart enables users to click on particular words and explore how they relate to the keyword “libertad”. The graph also sizes words according to their frequency in the corpus. Those expressions appearing more frequently are larger than those appearing less.

Salience analysis

In other cases, users want to determine which texts contain a given keyword more frequently. Imagine a set of 10 thousand laws and a researchers just want to analyze those where the word “tax” is prevalent. In this case, we can use the function tfRatio, term-frequency ratio that provides us with the ratio of the frequency of a word in a text compared to the frequency of the same word in the entire corpus. Therefore, values way above one are more frequent in the text than in the corpus and those below one are less common.

For instance, we can use the function tfRatio to determine which texts in the corpus cp contain words with “machis” more frequently than in the entire corpus (machismo, machista):


library(quanteda)
cp <- corpus(spa.inaugural)

# Calculate the ratio for those
# texts with the root "machis"
tt <- tfRatio(cp, "machis")

# Display the name of the documents
# containing words starting with "machis"
as.character(docid(cp)[tt > 0])
#> [1] "Zapatero II" "Sánchez I"   "Sánchez II"  "Sánchez III"

We can verify that the only matches are the last two presidents from the PSOE party: José Luís Rodríguez Zapatero and Pedro Sánchez.

Other functions perform the opposite operation. They look at reference texts and calculate the relevance of each word. The function plotKeyness calculates the keyness of words in a text or corpus. The keyness is a measure of the importance of a word in a text or corpus. The function requires a corpus and a reference corpus. The function will return a visualization with the words that are more relevant in the corpus compared to the reference corpus.

The closer to cero, the less relevant the word. On the contrary, the further from cero, the more relevant the word. Words that are in the center-left of the chart are the least salient. They appear few times and are not particularly useful to distinguish the reference text from other texts in the corpus. Words located either on the top-right or bottom-right are the most salient. They are frequent in the reference text and highly informative to distinguish the reference text from other texts in the corpus.


# Selects the session no. 124 of the Spanish parliament
# discussing the law againgst sexual violence.
spa <- spa.sessions[spa.sessions$session.number==124,]

# Aggregate Spanish parliamentary speeches
# by party
re <- aggregate(list(text=spa$speech.text), 
                by=list(rep.party=spa$rep.party),
                FUN=paste, 
                collapse="\n")

# Create a corpus object with the speeches
ci <- corpus(re)

# Group the corpus by party
ci <- corpus_group(ci, groups = rep.party)

# Plot the keyness (log-odds ratio) of the words 
# in the speeches
plotKeyness(corpus = ci,
            type = "log", 
            ref.cat = "Podemos", 
            title = "")

The most employed words by Podemos in the session nº 124 of the Spanish parliament are represented in blue: “violencias”, “feministas”, “impunidad”, “logotipo”, “Irene” (reference to the Minister of Gender Equality, Irene Montero). The relativelly less employed words by Podemos are represented in red: “violencia”, “votos”, “derechos”, “sexual”, “libertad”,“votos”,“votación”, among others.

Positional analyses

In other cases, users want to determine the position of a word in a text or corpus. For instance, a researcher may want to determine the position of the word “democracy” in the speeches of the Spanish parliament. The function plotLexDiv generates a lexical dispersion plot that shows the position of a word in a text or corpus. The function requires a corpus and a keyword as arguments. The function will return a visualization with the position of the keyword in the text or corpus.


# Create a lexical dispersion plot for
# the word "democracia" in the 
# inaugural speeches of Spanish presidents
plotLexDiv(corpus = cp,
           docvar = "doc_id",
           keyword = "democracia")

The same graph could be employed with a dictionary to highlight themes instead of individual keywords:


library(quanteda)

dic <- dictionary(
  list(democracia=c("democracia","democrát"),
       libertad=c("libertad"),
       igualdad=c("igualdad","equidad"))
)

# Create a lexical dispersion plot for
# the dictionary in the 
# inaugural speeches of Spanish presidents
plotLexDiv(corpus = cp,
           docvar = "doc_id",
           palette = pal$cat.awtools.spalette.6[1:3],
           keyword = dic)

Finally, we can use the functions filterWords and plotSpike to create lexical diversity plots for a large corpus. As an example, we will apply the same dictionary (democracia, igualdad, libertad) to all sessions of the Spanish parliament. First we aggregate all parliamentary speeches by session and then we create a corpus object with the results. Then, we create an interactive plot with the function plotSpike that shows the position of the keywords in the corpus, colored by each group in the dictionary. The plot below shows the 19,428 matches for the three categories in the dictionary in a total of 262 parliamentary sessions.

Since the plot is interactive and includes a large number of elements, it is preferrable to run the code in the console by copying the code below and pasting it in the console.


ag <- aggregate(list(text=spa.sessions$speech.text),
                by=list(session=spa.sessions$session.number),
                FUN=paste, 
                collapse="\n")

# Paste zeros to the number to allow
# sorting the sessions
ag$session[nchar(ag$session)==1] <- 
paste0("00", ag$session[nchar(ag$session)==1])

ag$session[nchar(ag$session)==2] <- 
paste0("0", ag$session[nchar(ag$session)==2])

# Create a corpus with the results
cs <- corpus(ag, docid_field = "session")

# Search the keywords and their
# position in each session
ter <- filterWords(cs, dic)

# Plot the results
plotSpike(data=ter,
          legend.title="Tema:",
          title="Congreso de los Diputados - XIV Legislatura (2019-2023)",
          subtitle="Democracia, libertad e igualdad en los debates de los plenos.", 
          svg.width =5, 
          svg.height = 5)